Improved ESP-index: A Practical Self-index for Highly Repetitive Texts
نویسندگان
چکیده
While several self-indexes for highly repetitive texts exist, developing a practical self-index applicable to real world repetitive texts remains a challenge. ESP-index is a grammar-based self-index on the notion of edit-sensitive parsing (ESP), an efficient parsing algorithm that guarantees upper bounds of parsing discrepancies between different appearances of the same subtexts in a text. Although ESP-index performs efficient top-down searches of query texts, it has a serious issue on binary searches for finding appearances of variables for a query text, which resulted in slowing down the query searches. We present an improved ESP-index (ESP-index-I) by leveraging the idea behind succinct data structures for large alphabets. While ESP-index-I keeps the same types of efficiencies as ESP-index about the top-down searches, it avoid the binary searches using fast rank/select operations. We experimentally test ESP-index-I on the ability to search query texts and extract subtexts from real world repetitive texts on a large-scale, and we show that ESPindex-I performs better that other possible approaches.
منابع مشابه
A compressed dynamic self-index for highly repetitive text collections
We present a novel compressed dynamic self-index for highly repetitive text collections. Signature encoding, an existing self-index of this type, has a large disadvantage of slow pattern search for short patterns. We obtain faster pattern search by leveraging the idea behind a truncated suffix tree (TST) to develop the first compressed dynamic self-index, called the TST-index, that supports not...
متن کاملsiEDM: an efficient string index and search algorithm for edit distance with moves
Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has...
متن کاملSelf-Index based on LZ77 (thesis)
Domains like bioinformatics, version control systems, collaborative editing systems (wiki), and others, are producing huge data collections that are very repetitive. That is, there are few differences between the elements of the collection. This fact makes the compressibility of the collection extremely high. For example, a collection with all different versions of a Wikipedia article can be co...
متن کاملSelf-Index Based on LZ77
We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that sou...
متن کاملUniversal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We int...
متن کامل